AITopics | fairness evaluation

Collaborating Authors

fairness evaluation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Fair Play for Individuals, Foul Play for Groups? Auditing Anonymization's Impact on ML Fairness

Arcolezi, Héber H., Alishahi, Mina, Bendoukha, Adda-Akram, Kaaniche, Nesrine

arXiv.org Artificial IntelligenceNov-3-2025

Machine learning (ML) algorithms are heavily based on the availability of training data, which, depending on the domain, often includes sensitive information about data providers. This raises critical privacy concerns. Anonymization techniques have emerged as a practical solution to address these issues by generalizing features or suppressing data to make it more difficult to accurately identify individuals. Although recent studies have shown that privacy-enhancing technologies can influence ML predictions across different subgroups, thus affecting fair decision-making, the specific effects of anonymization techniques, such as $k$-anonymity, $\ell$-diversity, and $t$-closeness, on ML fairness remain largely unexplored. In this work, we systematically audit the impact of anonymization techniques on ML fairness, evaluating both individual and group fairness. Our quantitative study reveals that anonymization can degrade group fairness metrics by up to fourfold. Conversely, similarity-based individual fairness metrics tend to improve under stronger anonymization, largely as a result of increased input homogeneity. By analyzing varying levels of anonymization across diverse privacy settings and data distributions, this study provides critical insights into the trade-offs between privacy, fairness, and utility, offering actionable guidelines for responsible AI development. Our code is publicly available at: https://github.com/hharcolezi/anonymity-impact-fairness.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.3233/faia250909

2505.07985

Country:

North America > United States (0.93)
Europe (0.68)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

The Quest for Reliable Metrics of Responsible AI

Rampisela, Theresia Veronika, Maistro, Maria, Ruotsalo, Tuukka, Lioma, Christina

arXiv.org Artificial IntelligenceOct-31-2025

The development of Artificial Intelligence (AI), including AI in Science (AIS), should be done following the principles of responsible AI. Progress in responsible AI is often quantified through evaluation metrics, yet there has been less work on assessing the robustness and reliability of the metrics themselves. We reflect on prior work that examines the robustness of fairness metrics for recommender systems as a type of AI application and summarise their key takeaways into a set of non-exhaustive guidelines for developing reliable metrics of responsible AI. Our guidelines apply to a broad spectrum of AI applications, including AIS.

artificial intelligence, fairness, metric, (14 more...)

arXiv.org Artificial Intelligence

2510.26007

Country:

North America > United States (0.16)
Europe > Denmark (0.16)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback

TrustSkin: A Fairness Pipeline for Trustworthy Facial Affect Analysis Across Skin Tone

Cabanas, Ana M., Pedro, Alma, Mery, Domingo

arXiv.org Artificial IntelligenceMay-28-2025

-- Understanding how facial affect analysis (F AA) systems perform across different demographic groups requires reliable measurement of sensitive attributes such as ancestry, often approximated by skin tone, which itself is highly influenced by lighting conditions. Using AffectNet and a MobileNet-based model, we assess fairness across skin tone groups defined by each method. Results reveal a severe underrepresentation of dark skin tones ( 2%), alongside fairness disparities in F1-score (up to 0.08) and TPR (up to 0.11) across groups. Grad-CAM analysis further highlights differences in model attention patterns by skin tone, suggesting variation in feature encoding. T o support future mitigation efforts, we also propose a modular fairness-aware pipeline that integrates perceptual skin tone estimation, model interpretability, and fairness evaluation. These findings emphasize the relevance of skin tone measurement choices in fairness assessment and suggest that IT A-based evaluations may overlook disparities affecting darker-skinned individuals. I. INTRODUCTION Predictive algorithms and biometric systems are increasingly used in critical areas such as healthcare, security, and human-computer interaction [1]. However, these systems remain prone to bias arising from demographic imbalances in training data and algorithmic design flaws [1]-[3]. In computer vision applications like EmotionAI and Facial Affect Analysis (FAA), such biases often result in consistent performance disparities across attributes like age, sex, and skin tone [4]-[6]. Given the sensitive deployment of FAA in psychological evaluation, driver monitoring, and educational feedback [1], [7], [8], ensuring fairness, transparency, and robustness across demographic groups is essential.

machine learning, natural language, skin tone, (16 more...)

arXiv.org Artificial Intelligence

2505.20637

Country: South America > Chile (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Dermatology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.89)
(2 more...)

Add feedback

seeBias: A Comprehensive Tool for Assessing and Visualizing AI Fairness

Ning, Yilin, Ma, Yian, Liu, Mingxuan, Li, Xin, Liu, Nan

arXiv.org Artificial IntelligenceApr-14-2025

Fairness in artificial intelligence (AI) prediction models is increasingly emphasized to support responsible adoption in high-stakes domains such as health care and criminal justice. Guidelines and implementation frameworks highlight the importance of both predictive accuracy and equitable outcomes. However, current fairness toolkits often evaluate classification performance disparities in isolation, with limited attention to other critical aspects such as calibration. To address these gaps, we present seeBias, an R package for comprehensive evaluation of model fairness and predictive performance. seeBias offers an integrated evaluation across classification, calibration, and other performance domains, providing a more complete view of model behavior. It includes customizable visualizations to support transparent reporting and responsible AI implementation. Using public datasets from criminal justice and healthcare, we demonstrate how seeBias supports fairness evaluations, and uncovers disparities that conventional fairness metrics may overlook. The R package is available on GitHub, and a Python version is under development.

artificial intelligence, assessment, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2504.08418

Country:

North America (0.28)
Asia > Singapore (0.16)

Genre: Research Report > Experimental Study (1.00)

Industry:

Law (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Epidemiology (0.93)
Health & Medicine > Health Care Providers & Services (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback

FairEval: Evaluating Fairness in LLM-Based Recommendations with Personality Awareness

Sah, Chandan Kumar, Lian, Xiaoli, Xu, Tony, Zhang, Li

arXiv.org Artificial IntelligenceApr-11-2025

Recent advances in Large Language Models (LLMs) have enabled their application to recommender systems (RecLLMs), yet concerns remain regarding fairness across demographic and psychological user dimensions. We introduce FairEval, a novel evaluation framework to systematically assess fairness in LLM-based recommendations. FairEval integrates personality traits with eight sensitive demographic attributes,including gender, race, and age, enabling a comprehensive assessment of user-level bias. We evaluate models, including ChatGPT 4o and Gemini 1.5 Flash, on music and movie recommendations. FairEval's fairness metric, PAFS, achieves scores up to 0.9969 for ChatGPT 4o and 0.9997 for Gemini 1.5 Flash, with disparities reaching 34.79 percent. These results highlight the importance of robustness in prompt sensitivity and support more inclusive recommendation systems.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.07801

Country:

Asia > China (0.28)
Europe > Italy (0.28)
North America > Canada (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fairness Evaluation with Item Response Theory

Xu, Ziqi, Kandanaarachchi, Sevvandi, Ong, Cheng Soon, Ntoutsi, Eirini

arXiv.org Artificial IntelligenceOct-20-2024

Item Response Theory (IRT) has been widely used in educational psychometrics to assess student ability, as well as the difficulty and discrimination of test questions. In this context, discrimination specifically refers to how effectively a question distinguishes between students of different ability levels, and it does not carry any connotation related to fairness. In recent years, IRT has been successfully used to evaluate the predictive performance of Machine Learning (ML) models, but this paper marks its first application in fairness evaluation. In this paper, we propose a novel Fair-IRT framework to evaluate a set of predictive models on a set of individuals, while simultaneously eliciting specific parameters, namely, the ability to make fair predictions (a feature of predictive models), as well as the discrimination and difficulty of individuals that affect the prediction results. Furthermore, we conduct a series of experiments to comprehensively understand the implications of these parameters for fairness evaluation. Detailed explanations for item characteristic curves (ICCs) are provided for particular individuals. We propose the flatness of ICCs to disentangle the unfairness between individuals and predictive models. The experiments demonstrate the effectiveness of this framework as a fairness evaluation tool. Two real-world case studies illustrate its potential application in evaluating fairness in both classification and regression tasks. Our paper aligns well with the Responsible Web track by proposing a Fair-IRT framework to evaluate fairness in ML models, which directly contributes to the development of a more inclusive, equitable, and trustworthy AI.

data mining, machine learning, predictive model, (19 more...)

arXiv.org Artificial Intelligence

2411.02414

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
Asia (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
(3 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Law (1.00)
Education > Curriculum > Subject-Specific Education (0.47)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Ethical AI Governance: Methods for Evaluating Trustworthy AI

McCormack, Louise, Bendechache, Malika

arXiv.org Artificial IntelligenceAug-28-2024

Trustworthy Artificial Intelligence (TAI) integrates ethics that align with human values, looking at their influence on AI behaviour and decision-making. Primarily dependent on self-assessment, TAI evaluation aims to ensure ethical standards and safety in AI development and usage. This paper reviews the current TAI evaluation methods in the literature and offers a classification, contributing to understanding self-assessment methods in this field.

evaluation, evaluation method, fairness, (16 more...)

arXiv.org Artificial Intelligence

2409.07473

Country:

North America > United States (0.14)
Europe > Portugal > Lisbon > Lisbon (0.04)
Europe > Italy > Campania > Naples (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government (0.68)

Technology: Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback

FairLENS: Assessing Fairness in Law Enforcement Speech Recognition

Wang, Yicheng, Cusick, Mark, Laila, Mohamed, Puech, Kate, Ji, Zhengping, Hu, Xia, Wilson, Michael, Spitzer-Williams, Noah, Wheeler, Bryan, Ibrahim, Yasser

arXiv.org Artificial IntelligenceMay-28-2024

Automatic speech recognition (ASR) techniques have become powerful tools, enhancing efficiency in law enforcement scenarios. To ensure fairness for demographic groups in different acoustic environments, ASR engines must be tested across a variety of speakers in realistic settings. However, describing the fairness discrepancies between models with confidence remains a challenge. Meanwhile, most public ASR datasets are insufficient to perform a satisfying fairness evaluation. To address the limitations, we built FairLENS - a systematic fairness evaluation framework. We propose a novel and adaptable evaluation method to examine the fairness disparity between different models. We also collected a fairness evaluation dataset covering multiple scenarios and demographic dimensions. Leveraging this framework, we conducted fairness assessments on 1 open-source and 11 commercially available state-of-the-art ASR models. Our results reveal that certain models exhibit more biases than others, serving as a fairness guideline for users to make informed choices when selecting ASR models for a given real-world scenario. We further explored model biases towards specific demographic groups and observed that shifts in the acoustic domain can lead to the emergence of new biases.

asr model, dataset, subgroup, (16 more...)

arXiv.org Artificial Intelligence

2405.13166

Country:

Europe > United Kingdom > England (0.15)
North America > United States > New York (0.05)
North America > United States > Texas (0.04)
(3 more...)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.46)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Fairness Evaluation for Uplift Modeling in the Absence of Ground Truth

Kadioglu, Serdar, Michalsky, Filip

arXiv.org Artificial IntelligenceFeb-12-2024

The acceleration in the adoption of AI-based automated decision-making systems poses a challenge for evaluating the fairness of algorithmic decisions, especially in the absence of ground truth. When designing interventions, uplift modeling is used extensively to identify candidates that are likely to benefit from treatment. However, these models remain particularly susceptible to fairness evaluation due to the lack of ground truth on the outcome measure since a candidate cannot be in both treatment and control simultaneously. In this article, we propose a framework that overcomes the missing ground truth problem by generating surrogates to serve as a proxy for counterfactual labels of uplift modeling campaigns. We then leverage the surrogate ground truth to conduct a more comprehensive binary fairness evaluation. We show how to apply the approach in a comprehensive study from a real-world marketing campaign for promotional offers and demonstrate its enhancement for fairness evaluation.

fairness evaluation, ground truth, intervention, (13 more...)

arXiv.org Artificial Intelligence

2403.12069

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(9 more...)

Genre:

Research Report > Experimental Study (0.70)
Research Report > New Finding (0.46)

Industry:

Marketing (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Information Technology (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

Counterpart Fairness -- Addressing Systematic between-group Differences in Fairness Evaluation

Wang, Yifei, Zhou, Zhengyang, Wang, Liqin, Laurentiev, John, Hou, Peter, Zhou, Li, Hong, Pengyu

arXiv.org Artificial IntelligenceAug-28-2023

When using machine learning (ML) to aid decision-making, it is critical to ensure that an algorithmic decision is fair, i.e., it does not discriminate against specific individuals/groups, particularly those from underprivileged populations. Existing group fairness methods require equal group-wise measures, which however fails to consider systematic between-group differences. The confounding factors, which are non-sensitive variables but manifest systematic differences, can significantly affect fairness evaluation. To tackle this problem, we believe that a fairness measurement should be based on the comparison between counterparts (i.e., individuals who are similar to each other with respect to the task of interest) from different groups, whose group identities cannot be distinguished algorithmically by exploring confounding factors. We have developed a propensity-score-based method for identifying counterparts, which prevents fairness evaluation from comparing "oranges" with "apples". In addition, we propose a counterpart-based statistical fairness index, termed Counterpart-Fairness (CFair), to assess fairness of ML models. Various empirical studies were conducted to validate the effectiveness of CFair. We publish our code at \url{https://github.com/zhengyjo/CFair}.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2305.1816

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Virginia (0.04)
North America > United States > New York (0.04)
(5 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Law (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Banking & Finance (1.00)
(4 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback